NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

CoPPar Tree: Fast and Composable Consistency at Global Scale

Yang, Xincheng; Hale, Kyle C (July 2025, IEEE)

Many distributed applications rely on the strong guarantees of sequential consistency to ensure program correctness. Replication systems or frameworks that support such applications typically implement sequential consistency by em- ploying voting schemes among replicas. However, such schemes suffer dramatic performance loss when deployed globally due to increased long-haul message latency between replicas in separate data centers. One approach to overcome this challenge involves deploying distinct instances of a service in each geographic cluster, then loosely coupling those services. Unfortunately, the consistency guarantees of the individual replication system in- stances do not compose when coupled this way, sacrificing overall sequential consistency. We propose an alternative approach, the consistent, propagatable partition tree (CoPPar Tree), a data structure that spans multiple data centers and data partitions, and that realizes sequential consistency using divide-and-conquer. By leveraging the geospatial affinity of data used in global services, CoPPar Tree can localize reads and writes in a sequentially consistent manner, improving the overall performance of a sequentially consistent service deployed at global scale. Our work allows clients to access local data and fully run SMR protocols locally without additional overhead. We implemented CoPPar Tree by enhancing ZooKeeper with an extension called ZooTree, which can be deployed without changing existing ZooKeeper clusters, and which achieves a speedup of 100×for reads and up to 10× for writes over prior work.
more » « less
Free, publicly-accessible full text available July 8, 2026
TrackFM: Far-out Compiler Support for a Far Memory World

https://doi.org/10.1145/3617232.3624856

Tauro, Brian R; Suchy, Brian; Campanoni, Simone; Dinda, Peter; Hale, Kyle C (April 2024, ACM)

Full Text Available
Bolt: Fast Inference for Random Forests

https://doi.org/10.1145/3528535.3531519

Romero, Eduardo; Stewart, Christopher; Li, Angela; Hale, Kyle; Morris, Nathaniel (October 2022, Proceedings of the 23rd ACM/IFIP International Middleware Conference)

Random forests use ensembles of decision trees to boost accuracy for machine learning tasks. However, large ensembles slow down inference on platforms that process each tree in an ensemble individually. We present Bolt, a platform that restructures whole random forests, not just individual trees, to speed up inference. Conceptually, Bolt maps every path in each tree to a lookup table which, if cache were large enough, would allow inference with just one memory access. When the size of the lookup table exceeds cache capacity, Bolt employs a novel combination of lossless compression, parameter selection, and bloom filters to shrink the table while preserving fast inference. We compared inference speed in Bolt to three state-of-the-art platforms: Python Scikit-Learn, Ranger, and Forest Packing. We evaluated these platforms using datasets with vision, natural language processing and categorical applications. We observed that on ensembles of shallow decision trees Bolt can run 2-14X faster than competing platforms and that Bolt's speedups persist as the number of decision trees in an ensemble increases.
more » « less
Full Text Available
Modeling Speedup in Multi-OS Environments

https://doi.org/10.1109/TPDS.2021.3114984

Tauro, Brian R.; Liu, Conghao; Hale, Kyle C. (June 2022, IEEE Transactions on Parallel and Distributed Systems)

Full Text Available
Coalescent Computing

https://doi.org/10.1145/3476886.3477503

Hale, Kyle C. (August 2021, Proceedings of the ACM Asia-Pacific Workshop on Systems (APSys 2021)))

Full Text Available
Isolating functions at the hardware limit with virtines

https://doi.org/10.1145/3492321.3519553

Wanninger, Nicholas C.; Bowden, Joshua J.; Shetty, Kirtankumar; Garg, Ayush; Hale, Kyle C. (March 2022, Proceedings of the 17th European Conference on Computer Systems (EuroSys 2022))

Full Text Available
The Case for an Interwoven Parallel Hardware/Software Stack

https://doi.org/10.1109/SCWS55283.2021.00017

Hale, Kyle C.; Campanoni, Simone; Hardavellas, Nikos; Dinda, Peter A. (November 2021, Proceedings of the 11th Workshop on Runtime and Operating Systems for Supercomputers)

Full Text Available
Enabling Extremely Fine-grained Parallelism via Scalable Concurrent Queues on Modern Many-core Architectures

https://doi.org/10.1109/MASCOTS53633.2021.9614292

Nookala, Poornima; Dinda, Peter; Hale, Kyle C.; Chard, Kyle; Raicu, Ioan (November 2021, Proceedings of the 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS '21))

Enabling efficient fine-grained task parallelism is a significant challenge for hardware platforms with increasingly many cores. Existing techniques do not scale to hundreds of threads due to the high cost of synchronization in concurrent data structures. To overcome these limitations we present XQueue, a novel lock-less concurrent queuing system with relaxed ordering semantics that is geared towards realizing scalability up to hundreds of concurrent threads. We demonstrate the scalability of XQueue using microbenchmarks and show that XQueue can deliver concurrent operations with latencies as low as 110 cycles at scales of up to 192 cores (up to 6900× improvement compared to traditional synchronization mechanisms) across our diverse hardware, including x86, ARM, and Power9. The reduced latency allows XQueue to provide orders of magnitude (3300×) better throughput that existing techniques. To evaluate the real-world benefits of XQueue, we integrated XQueue with LLVM OpenMP and evaluated five unmodified benchmarks from the Barcelona OpenMP Task Suite (BOTS) as well as a graph traversal benchmark from the GAP benchmark suite. We compared the XQueue-enabled LLVM OpenMP implementation with the native LLVM and GNU OpenMP versions. Using fine-grained task workloads, XQueue can deliver 4× to 6× speedup compared to native GNU OpenMP and LLVM OpenMP in many cases, with speedups as high as 116× in some cases.
more » « less
Full Text Available
Playing Fetch with CAT: Composing Cache Partitioning and Prefetching for Task-based Query Processing

https://doi.org/10.1145/3465998.3466016

Zeng, Qitian; Hale, Kyle C.; Glavic, Boris (June 2021, Proceedings of the 17th International Workshop on Data Management on New Hardware (DaMoN 2021))
null (Ed.)
Software prefetching and hardware-based cache allocation techniques (CAT) have been successfully applied in main-memory database engines to fetch data into cache before it is needed and to partition a shared last-level cache (LLC) to prevent concurrent tasks from evicting each others' data. We investigate the interaction of these techniques and demonstrate that while a single prefetching strategy is sufficient, the combination of both techniques is only effective if the cache partitioning strategy adapts the partitioning based on the types of tasks currently sharing an LLC. We present a simple, yet effective, scheme that uses prefetching and adapts cache partition allocations dynamically.
more » « less
Full Text Available
Memory Mapping and Parallelizing Random Forests for Speed and Cache Efficiency

https://doi.org/10.1145/3458744.3474052

Romero-Gainza, Eduardo; Stewart, Christopher; Li, Angela; Hale, Kyle; Morris, Nathaniel (August 2021, International Workshop on Parallel and Distributed Algorithms for Decision Sciences (PDADS 2021))

Full Text Available

« Prev Next »

Search for: All records